{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "605a0ae1-62ec-418c-a29b-b0cf24e01407",
   "metadata": {},
   "source": [
    "**** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "    make venv \n",
    "    source venv/bin/activate \n",
    "    pip install jupyterlab\n",
    "    venv/bin/jupyter lab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a517b15a-759a-4758-87dc-d92431defd68",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "## This is here as a reference only\n",
    "# Users and application developers must use the right tag for the latest from pypi\n",
    "%pip install 'data-prep-toolkit-transforms[bloom]'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dce85474-5fc8-4aab-9265-65c5fe5bc228",
   "metadata": {},
   "source": [
    "**** Configure the transform parameters. The set of dictionary keys holding BloomTransform configuration for values are as follows:\n",
    " - model_name_or_path - specify the GneissWeb Bloom filter model, which can be sourced from HuggingFace. We use a toy model for demonstration purposes.\n",
    " - annotation_column - the column name containing binary score in the output .parquet file. Defaults to is_in_GneissWeb.\n",
    " - doc_text_column - the column name containing the document text in the input .parquet file. Defaults to contents.\n",
    " - batch_size - modify it based on the infrastructure capacity. Defaults to 1000.\n",
    "\n",
    "***** Import required classes and modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c2a77f37-147a-401b-8007-4ce1c6f44c60",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_bloom.transform_python import BLOOM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "950cc5c4-117e-4bd2-8299-3de0be8072fe",
   "metadata": {},
   "source": [
    "***** Setup runtime parameters for this transform and invoke transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acf28558-dcd0-4b58-989b-65175eeacc07",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create parameters\n",
    "BLOOM(input_folder=\"test-data/input\",\n",
    "        output_folder=\"output\",\n",
    "        model_name_or_path= \"bf.bloom\",\n",
    "        annotation_column= \"is_in_GneissWeb\",\n",
    "        doc_text_column= \"contents\",\n",
    "        inference_engine= \"CPU\",\n",
    "        batch_size= 1000,\n",
    "        ).transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f933e2d4-f9e0-4c92-8560-214e9ff5d724",
   "metadata": {},
   "source": [
    "**** The specified folder will include the transformed parquet files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "4ca0d2ea-7b2f-4628-a787-fd94dbf3704d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['output/metadata.json', 'output/test1.parquet']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the outputs will be located in the following folders\n",
    "import glob\n",
    "glob.glob(\"output/*\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9988bba8-b0bb-478a-b057-68fb42519b98",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
